AITopics | automatic speech recognition

Collaborating Authors

automatic speech recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition

Neural Information Processing SystemsDec-24-2025, 19:18:03 GMT

Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate. In this paper, observing distinctive error patterns and correction operations (i.e., insertion, deletion, and substitution) in ASR, we propose FastCorrect, a novel NAR error correction model based on edit alignment. In training, FastCorrect aligns each source token from an ASR output sentence to the target tokens from the corresponding ground-truth sentence based on the edit distance between the source and target sentences, and extracts the number of target tokens corresponding to each source token during edition/correction, which is then used to train a length predictor and to adjust the source tokens to match the length of the target sentence for parallel generation. In inference, the token number predicted by the length predictor is used to adjust the source tokens for target sequence generation. Experiments on the public AISHELL-1 dataset and an internal industrial-scale ASR dataset show the effectiveness of FastCorrect for ASR error correction: 1) it speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model; and 2) it outperforms the popular NAR models adopted in neural machine translation and text edition by a large margin.

error correction, fast error correction, fastcorrect, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Neural Information Processing SystemsDec-24-2025, 02:01:22 GMT

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

automatic speech recognition, efficient transformer, squeezeformer, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Lin, Ye Bhone, Aung, Thura, Thu, Ye Kyaw, Oo, Thazin Myint

arXiv.org Artificial IntelligenceNov-27-2025

Abstract--This paper investigates sequence-to-sequence T ransformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IP A and alignment information. T o our knowledge, this is the first study addressing ASR error correction specifically for Burmese. W e evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word-and character-level accuracy over baseline outputs. The proposed AEC model, combining IP A and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.

data quality, error correction, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.21088

Country:

Europe (0.46)
Asia > Myanmar (0.16)
Asia > Thailand (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

Bian, Wesley, Lin, Xiaofeng, Cheng, Guang

arXiv.org Artificial IntelligenceNov-26-2025

Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.

artificial intelligence, machine learning, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2511.20534

Country: North America > United States > California > Los Angeles County > Los Angeles (0.29)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation

Al-Kharusi, Mohammed Hilal, Hayat, Khizar, Ruqeishi, Khalil Bader Al, Lone, Haroon Rashid

arXiv.org Artificial IntelligenceNov-14-2025

The art and science of Quranic recitation (Tajweed), a discipline governed by meticulous phonetic, rhythmic, and theological principles, confronts substantial educational challenges in today's digital age. Although modern technology offers unparalleled opportunities for learning, existing automated systems for evaluating recitation have struggled to gain broad acceptance or demonstrate educational effectiveness. This literature review examines this crucial disparity, offering a thorough analysis of scholarly research, digital platforms, and commercial tools developed over the past twenty years. Our analysis uncovers a fundamental flaw in current approaches that adapt Automatic Speech Recognition (ASR) systems, which emphasize word identification over qualitative acoustic evaluation. These systems suffer from limitations such as reliance on biased datasets, demographic disparities, and an inability to deliver meaningful feedback for improvement. Challenging these data-centric methodologies, we advocate for a paradigm shift toward a knowledge-based computational framework. By leveraging the unchanging nature of the Quranic text and the well-defined rules of Tajweed, we propose that an effective evaluation system should be built upon rule-based acoustic modeling centered on canonical pronunciation principles and articulation points (Makhraj), rather than depending on statistical patterns derived from flawed or biased data. The review concludes that the future of automated Quranic recitation assessment lies in hybrid systems that combine linguistic expertise with advanced audio processing. Such an approach paves the way for developing reliable, fair, and pedagogically effective tools that can authentically assist learners across the globe.

data mining, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.12858

Country:

Asia > Middle East (0.92)
Africa > Middle East > Egypt (0.46)

Genre:

Instructional Material (0.93)
Overview (0.88)
Research Report > New Finding (0.67)
Research Report > Promising Solution (0.46)

Industry:

Education > Educational Setting > Online (0.93)
Education > Educational Technology > Educational Software > Computer Based Training (0.68)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
(4 more...)

Add feedback

CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition

Sung, Hung-Yang, Wang, Chien-Chun, Huang, Kuan-Tang, Lo, Tien-Hong, Tsao, Yu-Sheng, Hsu, Yung-Chang, Chen, Berlin

arXiv.org Artificial IntelligenceNov-11-2025

Automatic speech recognition (ASR) for low-resource languages such as Taiwanese Hokkien is difficult due to the scarcity of annotated data. However, direct fine-tuning on Han-character transcriptions often fails to capture detailed phonetic and tonal cues, while training only on roman-ization lacks lexical and syntactic coverage. In addition, prior studies have rarely explored staged strategies that integrate both annotation types. To address this gap, we present CLiFT-ASR, a cross-lingual fine-tuning framework that builds on Mandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. The framework employs a two-stage process in which it first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions. This progressive adaptation enables effective alignment between speech sounds and orthographic structures. Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88% relative reduction in character error rate (CER) compared with strong baselines. The results indicate that CLiFT-ASR provides an effective and parameter-efficient solution for Taiwanese Hokkien ASR and that it has potential to benefit other low-resource language scenarios.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2511.0686

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

WST: Weakly Supervised Transducer for Automatic Speech Recognition

Gao, Dongji, Liao, Chenda, Liu, Changliang, Wiesner, Matthew, Garcia, Leibny Paola, Povey, Daniel, Khudanpur, Sanjeev, Wu, Jian

arXiv.org Artificial IntelligenceNov-7-2025

The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.04035

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Open Source State-Of-the-Art Solution for Romanian Speech Recognition

Pirlogeanu, Gabriel, Georgescu, Alexandru-Lucian, Cucu, Horia

arXiv.org Artificial IntelligenceNov-6-2025

Abstract--In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist T emporal Classification (CTC) and T oken-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications. Automatic Speech Recognition (ASR) has undergone a paradigm shift over the past decade, driven by the rise of end-to-end architectures and the increasing availability of large-scale datasets. Models such as RNN-Transducer, Transformer-Transducer, wav2vec, Whisper, Conformer [1] have dramatically improved recognition accuracy across many languages. Most recently, Speech Large Language Models (SpeechLLMs) [2] have further advanced the field by integrating multimodal and multilingual supervision at unprecedented scale.

artificial intelligence, natural language, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

2511.03361

Country: Europe (0.46)

Genre: Research Report > Promising Solution (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

The Analysis of Lexical Errors in Machine Translation from English into Romanian

Stamatie, Angela

arXiv.org Artificial IntelligenceNov-5-2025

The research explores error analysis in the performance of translating by Machine Translation from English into Romanian, and it focuses on lexical errors found in texts which include official information, provided by the World Health Organization (WHO), the Gavi Organization, by the patient information leaflet (the information about the active ingredients of the vaccines or the medication, the indications, the dosage instructions, the storage instructions, the side effects and warning, etc.). All of these texts are related to C ovid - 19 and have been translated by Google Translate, a multilingual Machine Translation that was created by Google. In the last decades, Google has actively work ed to develop a more accurate and fluent automatic translation system. This research, specifically focused on improving Google Translate, aims to enhance the overall quality of Machine Translation by achieving better lexical selection and by reducing errors. The investigation involves a comprehensive analysis of 230 texts that have been translated from English into Romanian.

machine learning, natural language, oxford lingua english romanian dictionary, (21 more...)

arXiv.org Artificial Intelligence

2511.02587

Country:

Asia (1.00)
Europe > United Kingdom > England (0.28)
North America > United States > Massachusetts (0.27)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Instructional Material (1.00)
(2 more...)

Industry:

Health & Medicine > Therapeutic Area > Vaccines (1.00)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
(11 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Neural Model for Contextual Biasing Score Learning and Filtering

Huang, Wanting, Wang, Weiran

arXiv.org Artificial IntelligenceOct-29-2025

Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors. Experiments on the Librispeech biasing benchmark show that our method effectively filters out majority of the candidate phrases, and significantly improves recognition accuracy under different biasing conditions when the scores are used in shallow fusion biasing. Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.23849

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.73)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback